Analysis of Naval Propulsion Data

Import the CSV data file assign1_NavalData.csv for analysis, and quickly check the structure of the data.

## 'data.frame':    10000 obs. of  18 variables:
##  $ X1 : num  3.14 7.15 5.14 4.16 9.3 9.3 2.09 1.14 8.21 8.21 ...
##  $ X2 : int  9 21 15 12 27 27 6 3 24 24 ...
##  $ X3 : num  8374 39007 21639 14722 72759 ...
##  $ X4 : num  1387 2678 1924 1547 3560 ...
##  $ X5 : num  7014 9116 8514 7758 9729 ...
##  $ X6 : num  60.3 332.5 175.3 113.8 644.7 ...
##  $ X7 : num  60.3 332.5 175.3 113.8 644.7 ...
##  $ X8 : num  586 822 705 653 1058 ...
##  $ X9 : int  288 288 288 288 288 288 288 288 288 288 ...
##  $ X10: num  578 687 640 610 772 ...
##  $ X11: num  1.39 2.99 2.07 1.66 4.55 4.52 1.33 1.26 3.6 3.58 ...
##  $ X12: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ X13: num  7.6 15.71 10.92 9.01 22.96 ...
##  $ X14: num  1.02 1.04 1.03 1.02 1.05 1.05 1.02 1.02 1.04 1.04 ...
##  $ X15: num  12.3 44 24.9 17.8 88.3 ...
##  $ X16: num  0.24 0.87 0.49 0.35 1.75 1.79 0.26 0.25 1.18 1.22 ...
##  $ Y1 : num  0.99 0.99 0.95 0.97 1 0.97 0.99 0.99 1 0.96 ...
##  $ Y2 : num  0.98 0.98 1 0.98 0.98 0.98 1 0.99 0.99 0.98 ...

The data is stored in a dataframe, with 10000 obs of 18 variables. Additionally, the variables of of 2 different data types, numeric (decimal) and integer data types.

The following table summarizes the features/variables in the dataset. You will also find them in the text file assign1_FeatureNames.txt. The features/variables X1 to X16 are the predictors, while Y1 and Y2 are the target response variables.

Variable Description
X1 Lever position (lp)
X2 Ship speed (v) [knots]
X3 Gas Turbine shaft torque (GTT) [kN m]
X4 Gas Turbine rate of revolutions (GTn) [rpm]
X5 Gas Generator rate of revolutions (GGn) [rpm]
X6 Starboard Propeller Torque (Ts) [kN]
X7 Port Propeller Torque (Tp) [kN]
X8 HP Turbine exit temperature (T48) [C]
X9 GT Compressor inlet air temperature (T1) [C]
X10 GT Compressor outlet air temperature (T2) [C]
X11 HP Turbine exit pressure (P48) [bar]
X12 GT Compressor inlet air pressure (P1) [bar]
X13 GT Compressor outlet air pressure (P2) [bar]
X14 Gas Turbine exhaust gas pressure (Pexh) [bar]
X15 Turbine Injecton Control (TIC) [%]
X16 Fuel flow (mf) [kg/s]
Y1 GT Compressor decay state coefficient
Y2 GT Turbine decay state coefficient

The data is from a simulator of a naval vessel, characterized by a Gas Turbine (GT) propulsion plant. You may treat the available data as if it is from a hypothetical naval vessel. The propulsion system behaviour has been described with the parameters X1 to X16, as detailed above, and the target is to predict the performance decay of the GT components such as GT Compressor and GT Turbine.

Task : Build the best possible Linear Model you can to predict both Y1 and Y2, using the training dataset assign1_NavalData.csv. Then predict Y1 and Y2 values using your model on the test dataset assign1_NavalPred.csv.

Model Building

Perform Linear Regression (Y1 vs all others) & (Y2 vs all others)

Fit a linear model on Y1/Y2 vs all other variables. This is the first model — also called the FULL MODEL

Model for Y1: Residual standard error: 0.006122 on 9986 degrees of freedom Multiple R-squared: 0.8398, Adjusted R-squared: 0.8396 F-statistic: 4027 on 13 and 9986 DF, p-value: < 2.2e-16 It seems that X4 is least significant (p-value = 0.944)

Model for Y2: Residual standard error: 0.00355 on 9986 degrees of freedom Multiple R-squared: 0.7885, Adjusted R-squared: 0.7882 F-statistic: 2864 on 13 and 9986 DF, p-value: < 2.2e-16 It seems that X2 is least significant (p-value = 0.0431)

Fit a linear model on Y1/Y2 vs all variables but the one with the least significance in the previous model

Model for Y1: Multiple R-squared and Adjusted R-squared stay the same as the variable removed is too insignificant. F-statistic improved to 4363 compared to previous model and by removing X4, all the remaining variables are highly significant. There is no remaining less significant variables to be remove.

Model for Y2: Multiple R-squared decreased to 0.7884 as a variable was removed. Adjusted R-squared decreased to 0.7881 too, but not by much and by removing X2, all the remaining variables are highly significant and the F-statistic improved to 3101. There is no remaining less significant variables to be remove.

Checking for Non-linear Relations with Variables

Plot Y1 against remaining variables individually to check for non-linear relations (blue) Plot Y2 against remaining variables individually to check for non-linear relations (green)

Y1: Majority of the features shows non-linear relations with the target except for X1, X2 and X14. The scatter plot points are skewed to the left side and thus a curve is able to fit the points much better. The most prominent non-linear relation i would say is X15.

Y2: Majority of the features shows non-linear relations with the target except for X1 and X14. The scatter plot points are skewed to the left side and thus a curve is able to fit the points much better. The most prominent non-linear relation i would say is X8.

Fit a linear model on Y1/Y2 vs remaining variables but introduce non-linear term(s) as per your observation above

Model for Y1: Multiple R-squared and Adjusted R-squared improved as expected to 0.8401 and 0.8398. And by adding a new variable, F-statistic drops but not by much to 4034, and all the remaining variables are highly significant.

Model for Y2: Multiple R-squared and Adjusted R-squared improved as expected to 0.7888 and 0.7885. And by adding a new variable, F-statistic drops but not by much to 2868, and all the remaining variables are highly significant.

X8 and X15 also shows non linearity with Y1 and Y2. Fit a linear model on Y1/Y2 vs remaining variables but introduce non-linear term(s) as per your observation above

Model for Y1: Multiple R-squared and Adjusted R-squared improved as expected 0.8499 and 0.8497. And even by adding a new variable, F-statistic improved but not by much to 4038. However, with this addition, X3, X8, I(X15^2) became less significant.

Model for Y2: Multiple R-squared and Adjusted R-squared improved as expected to 0.803 and 0.8028. And even by adding a new variable, F-statistic improved but not by much to 2908, and all the remaining variables are highly significant.

I(X15^2) is the least significant variable in Y1 model. Also, X10 shows non-linearity with Y2. Fit a linear model on Y1 vs remaining variables but remove the less significant variables Fit a linear model on Y2 vs remaining variables but introduce non-linear term(s) as per your observation above

Model for Y1: Multiple R-squared and Adjusted R-squared stayed the same. And by removing a less significant variable, F-statistic improved but not by much to 4349. X3 and X8 are the remaining less significant variables in the model.

Model for Y2: Multiple R-squared and Adjusted R-squared improved as expected to 0.8058 and 0.8055. And by adding a new variable, F-statistic drops but not by much to 2762, and all the remaining variables are highly significant.

X3 is the least significant variable in Y1 model. Also, X16 shows non-linearity with Y2. Fit a linear model on Y1 vs remaining variables but remove the less significant variables Fit a linear model on Y2 vs remaining variables but introduce non-linear term(s) as per your observation above

Model for Y1: Multiple R-squared drops slightly but Adjusted R-squared stayed the same. And by removing a less significant variable, F-statistic improved but not by much to 4710, and all the remaining variables are highly significant.

Model for Y2: Multiple R-squared and Adjusted R-squared improved as expected to 0.809 and 0.8087. And by adding a new variable, F-statistic drops but not by much to 2643, and all the remaining variables are highly significant.

X16 show non-linearity with Y1. Also, X3 show non-linearity with Y2. Fit a linear model on Y1/Y2 vs remaining variables but introduce non-linear term(s) as per your observation above

Model for Y1: Multiple R-squared stayed the same but Adjusted R-squared drops to 0.8496. And by adding a variable, F-statistic drops but not by much to 4348. Additionally I(X16^2) is a less significant variable and so i think that it would be better to stick with previous model.

Model for Y2: Multiple R-squared and Adjusted R-squared improved as expected to 0.8104 and 0.81. And by adding a new variable, F-statistic drops but not by much to 2509, and X5 became slighlty less significant could attempt to remove.

X10 shows non-linearity with Y1. Also, X5 became slighlty less significant could attempt to remove in Y2 model. Fit a linear model on Y1 vs remaining variables but introduce non-linear term(s) as per your observation above Fit a linear model on Y2 vs remaining variables but remove the less significant variables

Model for Y1: Multiple R-squared and Adjusted R-squared improved as expected to 0.9183 and 0.9182. And even by adding a variable, F-statistic increases by a huge amount to 8630, and all remaining variables are highly significant.

Model for Y2: Multiple R-squared drops slightly to 0.8103 but Adjusted R-squared stayed the same. And by removing a less significant variable, F-statistic improves but not by much to 2665, and all remaining variables are highly significant.

X11 shows non-linearity with Y1. X4 shows non-linearity with Y2. Fit a linear model on Y1/Y2 vs remaining variables but introduce non-linear term(s) as per your observation above

Model for Y1: Multiple R-squared and Adjusted R-squared improved as expected to 0.9212 and 0.9211. And by adding a variable, F-statistic drops but not by much to 8343, and all remaining variables are highly significant.

Model for Y2: Multiple R-squared and Adjusted R-squared had little improvement to 0.8104 and 0.8101. And by removing a less significant variable, F-statistic drops but not by much to 2510. However the newly added variable is the least significant and i think the little improvement is not ideal and its better to revert to the previous model.

X6 shows non-linearity with Y1. X13 shows non-linearity with Y2. Fit a linear model on Y1/Y2 vs remaining variables but introduce non-linear term(s) as per your observation above

Model for Y1: Multiple R-squared and Adjusted R-squared improved as expected to 0.9295 and 0.9294. And even by adding a new variable, F-statistic improved but not by much to 8779, and all remaining variables are highly significant.

Model for Y2: Multiple R-squared and Adjusted R-squared improved as expected to 0.8209 and 0.8206 . And even by adding a new variable, F-statistic improved but not by much to 2692. I(X16^2) became alot less significant and can consider to remove.

X13 shows non-linearity with Y1. Also, I(X16^2) became alot less significant and can consider to remove from Y2 model. Fit a linear model on Y1 vs remaining variables but introduce non-linear term(s) as per your observation above Fit a linear model on Y2 vs remaining variables but the one with the least significance

Model for Y1: Multiple R-squared and Adjusted R-squared improved as expected to 0.9483 and 0.9482. And even by adding a new variable, F-statistic improved by a huge amount to 1.145e+04, and all remaining variables are highly significant.

Model for Y2: Multiple R-squared and Adjusted R-squared stayed the same. And by removing a less significant variable, F-statistic improved but not by much to 2859, and all remaining variables are highly significant.

X5 shows non-linearity with Y1. X6 shows non-linearity with Y2. Fit a linear model on Y1/Y2 vs remaining variables but introduce non-linear term(s) as per your observation above

Model for Y1: Multiple R-squared and Adjusted R-squared improved as expected to 0.9497 and 0.9496. And by adding a new variable, F-statistic drops by a decent amount to 1.108e+04, but all remaining variables are still highly significant.

Model for Y2: Multiple R-squared and Adjusted R-squared improved as expected to 0.8296 and 0.8294 . And even by adding a new variable, F-statistic improved but not by much to 2860, and all remaining variables are highly significant.

Check for Non-linear mutual Interactions

Fit a linear model on Y1/Y2 vs remaining variables but introduce a non-linear interaction term between the variables that showed prominent trends of non-linearity previously based on the improvement in R and F-stats. X11 and X5 show great improvement for Y1. X15 and X10 show great improvement for Y2.

Model for Y1: Multiple R-squared and Adjusted R-squared improved as expected to 0.961 and 0.9609 . And even by adding a new variable, F-statistic improved but not by much to 1.367e+04. X1, X16 became less significant in the process and can be considered to be removed.

Model for Y2: Multiple R-squared and Adjusted R-squared improved as expected to 0.8403 and 0.84 . And even by adding a new variable, F-statistic improved but not by much to 2918. X10 became less significant in the process and can be considered to be removed.

X16 is the least significant and can consider to remove from Y1 model. Also, X10 is the least significant and can consider to remove from Y2 model. Fit a linear model on Y1/Y2 vs remaining variables but the one with the least significance

Model for Y1: Multiple R-squared and Adjusted R-squared stayed the same. And by removing a less significant variable, F-statistic improved but not by much to 1.446e+04. X1 is still less significant and can be considered to be removed.

Model for Y2: Multiple R-squared and Adjusted R-squared stayed the same. And by removing a less significant variable, F-statistic improved but not by much, to 3090, and all the remaining variables are highly significant.

X1 is the least significant and can consider to remove from Y1 model. Fit a linear model on Y1 vs remaining variables but the one with the least significance

Model for Y1: Multiple R-squared and Adjusted R-squared stayed the same. And by removing a less significant variable, F-statistic improved but not by much to 1.536e+04, and all the remaining variables are highly significant.

Check the (so far) Best Model more carefully print the “summary” of your Best Models (so far) for Y1 & Y2

## 
## Call:
## lm(formula = Y1 ~ X2 + X5 + X6 + X8 + X10 + X11 + X13 + X14 + 
##     X15 + I(X8^2) + I(X10^2) + I(X11^2) + I(X6^2) + I(X13^2) + 
##     I(X5^2) + X5:X11, data = navalData)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0103563 -0.0024057 -0.0000111  0.0024299  0.0086610 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.402e+00  1.105e-01   48.90   <2e-16 ***
## X2           1.904e-03  1.217e-04   15.64   <2e-16 ***
## X5          -3.767e-04  1.289e-05  -29.22   <2e-16 ***
## X6          -5.575e-04  2.996e-05  -18.61   <2e-16 ***
## X8          -6.573e-04  3.813e-05  -17.24   <2e-16 ***
## X10         -8.611e-03  2.519e-04  -34.19   <2e-16 ***
## X11          1.453e+00  2.430e-02   59.82   <2e-16 ***
## X13          1.227e-01  1.276e-03   96.18   <2e-16 ***
## X14         -9.972e-01  4.085e-02  -24.41   <2e-16 ***
## X15          2.373e-04  1.405e-05   16.89   <2e-16 ***
## I(X8^2)      2.550e-07  2.410e-08   10.58   <2e-16 ***
## I(X10^2)     4.619e-06  1.804e-07   25.60   <2e-16 ***
## I(X11^2)     4.448e-02  2.303e-03   19.31   <2e-16 ***
## I(X6^2)      1.579e-06  5.129e-08   30.80   <2e-16 ***
## I(X13^2)    -2.837e-03  3.822e-05  -74.23   <2e-16 ***
## I(X5^2)      4.001e-08  1.030e-09   38.87   <2e-16 ***
## X5:X11      -1.897e-04  2.598e-06  -73.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.003022 on 9983 degrees of freedom
## Multiple R-squared:  0.961,  Adjusted R-squared:  0.9609 
## F-statistic: 1.536e+04 on 16 and 9983 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = Y2 ~ X1 + X3 + X4 + X6 + X8 + X11 + X13 + X14 + 
##     X15 + X16 + I(X8^2) + I(X15^2) + I(X10^2) + I(X3^2) + I(X13^2) + 
##     I(X6^2) + X10:X15, data = navalData)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0101585 -0.0023576  0.0001959  0.0024032  0.0095107 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.016e-01  3.668e-02  19.129   <2e-16 ***
## X1           1.858e-02  5.135e-04  36.179   <2e-16 ***
## X3           1.924e-05  3.120e-07  61.668   <2e-16 ***
## X4           3.050e-05  2.100e-06  14.526   <2e-16 ***
## X6          -1.978e-03  4.309e-05 -45.910   <2e-16 ***
## X8          -1.353e-03  2.499e-05 -54.150   <2e-16 ***
## X11          2.907e-01  7.813e-03  37.212   <2e-16 ***
## X13         -8.593e-02  1.410e-03 -60.944   <2e-16 ***
## X14          5.405e-01  3.578e-02  15.105   <2e-16 ***
## X15          1.018e-02  3.046e-04  33.433   <2e-16 ***
## X16          9.110e-02  9.917e-03   9.186   <2e-16 ***
## I(X8^2)      5.966e-07  1.814e-08  32.891   <2e-16 ***
## I(X15^2)     2.537e-05  8.431e-07  30.092   <2e-16 ***
## I(X10^2)     1.098e-06  2.644e-08  41.518   <2e-16 ***
## I(X3^2)     -2.661e-10  1.016e-11 -26.186   <2e-16 ***
## I(X13^2)     7.932e-04  2.528e-05  31.376   <2e-16 ***
## I(X6^2)      2.652e-06  1.095e-07  24.229   <2e-16 ***
## X15:X10     -1.908e-05  5.621e-07 -33.940   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.003085 on 9982 degrees of freedom
## Multiple R-squared:  0.8403, Adjusted R-squared:   0.84 
## F-statistic:  3090 on 17 and 9982 DF,  p-value: < 2.2e-16

Check the model for potential outliers

There are some visible outliers that can be remove in order to boost models performance.

Remove outliers and high-leverage points Fit your best model to the clean dataset

## [1] 9574
## Y1 ~ X2 + X5 + X6 + X8 + X10 + X11 + X13 + X14 + X15 + I(X8^2) + 
##     I(X10^2) + I(X11^2) + I(X6^2) + I(X13^2) + I(X5^2) + X5:X11
## 
## Call:
## lm(formula = formula(lmFit14a), data = navalData_A.clean)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0066254 -0.0023018 -0.0000168  0.0022993  0.0061913 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.108e+00  1.092e-01   46.78   <2e-16 ***
## X2           1.735e-03  1.223e-04   14.19   <2e-16 ***
## X5          -3.890e-04  1.258e-05  -30.92   <2e-16 ***
## X6          -5.487e-04  2.976e-05  -18.43   <2e-16 ***
## X8          -8.098e-04  3.775e-05  -21.45   <2e-16 ***
## X10         -7.681e-03  2.499e-04  -30.73   <2e-16 ***
## X11          1.501e+00  2.350e-02   63.87   <2e-16 ***
## X13          1.210e-01  1.227e-03   98.60   <2e-16 ***
## X14         -9.207e-01  3.944e-02  -23.34   <2e-16 ***
## X15          2.501e-04  1.389e-05   18.00   <2e-16 ***
## I(X8^2)      3.666e-07  2.423e-08   15.13   <2e-16 ***
## I(X10^2)     3.888e-06  1.798e-07   21.62   <2e-16 ***
## I(X11^2)     4.757e-02  2.282e-03   20.84   <2e-16 ***
## I(X6^2)      1.530e-06  5.080e-08   30.13   <2e-16 ***
## I(X13^2)    -2.873e-03  3.720e-05  -77.23   <2e-16 ***
## I(X5^2)      4.105e-08  1.005e-09   40.83   <2e-16 ***
## X5:X11      -1.946e-04  2.522e-06  -77.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.002819 on 9557 degrees of freedom
## Multiple R-squared:  0.9659, Adjusted R-squared:  0.9659 
## F-statistic: 1.694e+04 on 16 and 9557 DF,  p-value: < 2.2e-16

## [1] 9571
## Y2 ~ X1 + X3 + X4 + X6 + X8 + X11 + X13 + X14 + X15 + X16 + I(X8^2) + 
##     I(X15^2) + I(X10^2) + I(X3^2) + I(X13^2) + I(X6^2) + X10:X15
## 
## Call:
## lm(formula = formula(lmFit13b), data = navalData_B.clean)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.006957 -0.002242  0.000209  0.002307  0.006690 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.913e-01  3.567e-02  13.773   <2e-16 ***
## X1           2.237e-02  5.140e-04  43.514   <2e-16 ***
## X3           2.120e-05  3.063e-07  69.195   <2e-16 ***
## X4           3.867e-05  2.029e-06  19.057   <2e-16 ***
## X6          -2.275e-03  4.275e-05 -53.219   <2e-16 ***
## X8          -1.513e-03  2.572e-05 -58.830   <2e-16 ***
## X11          2.480e-01  7.612e-03  32.578   <2e-16 ***
## X13         -7.938e-02  1.369e-03 -57.974   <2e-16 ***
## X14          7.555e-01  3.507e-02  21.542   <2e-16 ***
## X15          1.189e-02  3.214e-04  36.981   <2e-16 ***
## X16          7.862e-02  9.456e-03   8.314   <2e-16 ***
## I(X8^2)      6.778e-07  1.892e-08  35.827   <2e-16 ***
## I(X15^2)     2.935e-05  8.642e-07  33.956   <2e-16 ***
## I(X10^2)     1.249e-06  2.599e-08  48.062   <2e-16 ***
## I(X3^2)     -2.177e-10  1.007e-11 -21.617   <2e-16 ***
## I(X13^2)     7.324e-04  2.423e-05  30.233   <2e-16 ***
## I(X6^2)      2.261e-06  1.083e-07  20.878   <2e-16 ***
## X15:X10     -2.231e-05  5.932e-07 -37.609   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.002861 on 9553 degrees of freedom
## Multiple R-squared:  0.8613, Adjusted R-squared:  0.8611 
## F-statistic:  3490 on 17 and 9553 DF,  p-value: < 2.2e-16

Both models improved significantly in terms of R and F-statistic.


Prediction of Naval Propulsion Data

Import the CSV data file assign1_NavalPred.csv. And use best model to predict values. Values of Y1 and Y2 are stored in a csv file assign1_NavalPred_result.csv.